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When the amount of entanglement in a quantum system is limited, the relevant dynamics of the 
system is restricted to a very small part of the state space. When restricted to this subspace the 
description of the system becomes efficient in the system size. A class of algorithms, exemplified by 
the Time-Evolving Block-Decimation (TEBD) algorithm, make use of this observation by selecting 
the relevant subspace through a decimation technique relying on the Singular Value Decomposition 
(SVD). In these algorithms, the complexity of each time-evolution step is dominated by the SVD. 
Here we show that, by applying a randomized version of the SVD routine (RRSVD), the power 
law governing the computational complexity of TEBD is lowered by one degree, resulting in a 
considerable speed-up. We exemplify the potential gains in efficiency at the hand of some real world 
examples to which TEBD can be successfully applied to and demonstrate that for those system 
RRSVD delivers results as accurate as state-of-the-art deterministic SVD routines. 


I. INTRODUCTION 

The description of the physics of quantum many-body 
systems suffers from the ’’curse of dimensionality”, that 
is, the size of the parameter set that is required to achieve 
an exact description of the physical state of a quantum 
many-body system grows exponentially in the number 
of its subsystems. Therefore, the simulation of quantum- 
many-body systems by classical means appears to require 
an exponential amount of computational resources which, 
in turn, would impose severe limitations on the size of 
the quantum many-body systems that are amenable to 
classical simulation. 

On the other hand, exactness of description may be 
traded for an approximate representations of the state 
and dynamics of a quantum many-body system for as 
long as the quality of the approximation can be controlled 
and increased at will and whenever such an approximate 
treatment results in a polynomial scaling of the resources 
with the system size. Clearly, this will not be possible 
for all system dynamics as it would imply the classical 
simulability of quantum computers which is generally not 
believed to be the case. 

One specific setting of considerable practical impor¬ 
tance that allows for efficient approximate description of 
quantum many-body systems concerns systems whose en¬ 
tanglement content is limited. Indeed, in pure states that 
factor, at all times, into a product of states; each involv¬ 
ing a number of qubits that is bounded from above for all 
times by a constant [1] can be simulated efficiently on a 
classical device. Going beyond this, it is well-known that 
the states of I-D quantum systems often obey an area law 
[2-6] which represents a severe limitation of their entan¬ 
glement content. This can be made use of, as the state 
of such slightly entangled 1-D quantum many-body sys¬ 
tem in a pure state can be described efficiently by means 
of the Density Matrix Renormalization Group (DMRG) 


[7]. It was noticed early on [8] that DMRG amounts to 
the approximation of the state of the system by a matrix 
product state [9] which can be constructed systematically 
in terms of a consecutive Schmidt decomposition. 

This approach can be extended to the dynamics of one¬ 
dimensional quantum many-body systems. The time- 
dependent DMRG (t-DMRG) [10], the time-dependent 
Matrix Product States (t-MPS) [11], and the Time- 
Evolving Block-Decimation (TEBD) [12, 13] are all algo¬ 
rithms based on the idea of evolving an MPS [9] in time. 
In settings in which only a restricted amount of entan¬ 
glement is present in the system, these methods are very 
efficient and the cost of simulation scales merely polyno- 
mially in the system size. A very clear presentation of 
these three algorithms as well as a discussion of the main 
differences between them can be found in [14]. 

All these algorithms rely in an essential way on the sin¬ 
gular value decomposition (SVD) which, due to its com¬ 
putational complexity, represents a bottleneck in their 
implementations. In this work we show how to address 
this issue within the TEBD framework by replacing the 
SVD which is used to restrict the Hilbert space of quan¬ 
tum states relevant to the system dynamics, with the 
Reduced-Rank Randomized SVD (RRSVD) proposed in 
[15]. We show that when the Schmidt coefficients decay 
exponentially, the polynomial complexity of the TEBD 
decimation step can be reduced from 0{n^) to 0{Tn?) 
with n indicating the linear dimension of the square ma¬ 
trix to decompose. This results in a considerable speed¬ 
up for the TEBD algorithm for real world problems, en¬ 
abling the access of larger parameter spaces, faster simu¬ 
lation times, and thus opening up new regimes that were 
previously inaccessible. 

The paper is organized as follows: Section II provides 
a brief description of the TEBD algorithm for pure and 
mixed state dynamics in order to make the manuscript 
self-contained for the non-specialist and to identify and 
highlight the crucial step in which the RRSVD routine 
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can be applied for considerable benefit. Section III in¬ 
troduces a specific and challenging application of TEBD, 
the TEDOPA scheme, which maps an open quantum sys¬ 
tem on a one-dimensional configuration, allowing for the 
efficient simulation of its dynamics. Section IV the pro¬ 
ceeds with a description of the salient features of the 
RRSVD algorithm and a discussion of the speed-up that 
it provides over the standard SVD. Benchmarking in the 
TEBD context with applications to TEDOPA along with 
stability analysis are presented in section V. The last sec¬ 
tion is devoted to conclusions and outlook. 


II. AN ALGORITHM FOR ONE-DIMENSIONAL 
QUANTUM SYSTEMS 

The time evolving block decimation (TEBD) is an al¬ 
gorithm that generates efficiently an approximation to 
the time evolution of a one-dimensional system subject 
to a nearest-neighbor Hamiltonian. Under the condi¬ 
tion that the amount of entanglement in the system is 
bounded a high fidelity approximation requires polyno- 
mially scaling computational resources. TEBD does so 
by dynamically restricting the exponentially large Hilbert 
space to its most relevant subspace whose size is scaling 
polynomially in the system size, thus rendering the com¬ 
putation feasible [13, 14]. 

TEBD is essentially a combination of an MPS descrip¬ 
tion for a one-dimensional quantum system and an al¬ 
gorithm that applies two-site gates that are necessary 
to implement a Suzuki-Trotter time evolution. Together 
with MPS operations such as the application of measure¬ 
ments this yields a powerful simulation framework [9]. 

While originally formulated for pure states, an exten¬ 
sion to mixed states is possible by introducing a matrix 
product operator (MPO) to describe the density matrix, 
in complete analogy to an MPS describing a state [16]. 
The simulation procedure remains unchanged, except for 
details such as a squaring of the local dimension on each 
site, the manner in which two-site gates are built, the 
procedures to achieve normalisation as well as the im¬ 
plementation of measurements [14, 16]. While standard 
MPO formulations cannot ensure positivity of the state 
efficiently, recent reformulations can account for this fea¬ 
ture too [17]. Important for the present work are implica¬ 
tions of these modifications on the two-site update - and 
while the numerical recipe does not change, its scaling 
does. 


A. Introduction to MPS 

The remarkable usefulness and broad range of appli¬ 
cations of the MPS description for quantum states has 
been widely recognized early on in the development of 
DMRG algorithms [8]. To better understand the full ex¬ 
tent of the presented work, we highlight the relevant key 


features of MPS, referring to references [9, 14] for a full 
account. 

Let us introduce an MPS for a pure state of N sites. 
Eor simplicity we assume that each site has the same 
number of dimensions d, the extension to varying dimen¬ 
sion is straight forward. The MPS then relates the ex¬ 
pansion coefficients in the Pock basis to a set of 

N ■ d matrices P and N — 1 matrices A 


^) = H 




*1*2 • • • *Ar ) 


(1) 


= r[i]*i . y[i] . p[2]*2 . _ . 

...A[^-i]r[^]*«|iif2---iiv). (2) 


Each of the N sites is assigned a set of d matrices P 
which have dimension xi x Xr- The index k in square 
brackets denotes the corresponding site and the ik the 
corresponding physical state. The diagonal Xb x Xb ma¬ 
trices A are assigned to the bond k between sites k and 
fc -I- 1. The structure of the MPS is such that the ma¬ 
trices A contain the Schmidt values for a bipartition at 
this bond. The matrices P and A are related to the coef¬ 
ficients c by 

• Aih . pPii^..... . ri^hiv (3) 

with matrix dimensions Xh Xrj and Xb for all T and A 
such that building the product yields a number. The 
main reason for employing an MPS description is the re¬ 
duction from d^ coefficients c to only O (^dNx^) when 
the matrices P and A are at most of size x x y. This de¬ 
scription is efficient, provided that the matrix size y (also 
known as the bond dimension) is restricted - which it is, 
if the amount of entanglement in the system is bounded. 
Eurther the MPS structure entails that local gates ap¬ 
plied to the whole state only change the matrices of the 
sites they act on - thus updates are inexpensive, as op¬ 
posed to a full state vector description. 


B. Two-site gates 

The crucial ingredient in this simulation scheme is the 
SVD. It is the solution to the question of how to apply a 
two-site gate to an MPS or MPO. 

The fact that a gate G acting on the two sites k and 
fc -I- I only changes the matrices local to these sites can 
easily be seen from 

Gk,k+i\ ^)= •... • (4) 

I *1 • • ■ ik-i )Gk,k+i \ "ikik+i ) \ ik+2 ■■ - In) 

= E E r[r]*i . y[i] ..... 

il-.-lN jkJk+1 

(*fc*fc+i |Gfc.fc+i| jkjk+i )\ii ■ ■ Gn ). (5) 
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Above we first introduced the completeness relation for 
jk and jk+h followed by switching indices ik with jk and 
ik+i with jk+i- Identifying all terms related to jk and 
jk+i now defines a tensor 0 of fourth order 

e{ik,ik+ua,f3) = 

7 

( 6 ) 

which can be built at a numerical cost of 
O (dk • dk+i • X^) basic operations. This tensor needs to 
be updated when applying the gate G. The update rule 
from Eq. (5) yields the relation 

0 ijk^ 7 +I 5 0^5 /5) — ^ ^ { ‘ik'^k+1 jkjk-\-l ) ‘ 

jk,jk+l 

■ <d{jk,jk+i,a,f3). (7) 

This sum is performed for all a and (3 - which in gen¬ 
eral run from 1 to y. Thus there are dk x dk+i products, 
which gives the number of basic operations for the update 
of 0 as O (df ■ dl ■ x^) [12]- This formula however only 
enables the update of the complete matrix 0 and not of 
the single entities A^^l and They still have to 

be extracted from these updated products. To do this, 
0 is written in a blocked index form Q(dkx),{dk+ix) which 
then is singular value-decomposed [12]. The general sin¬ 
gular value decomposition (SVD) scales as O (m ■ for 
a m X n-matrix, thus resulting in an overall computa¬ 
tional cost of O {di ■ d‘1 ■ x^) ■ This makes the SVD the 
real bottleneck in the simulation, consuming by far the 
most resources, and therefore the first target for improve¬ 
ments. 

Here we stick to the (unofficial) standard notation for 
MPS parameters in the context of the TEBD algorithm 
and denote the diagonal matrices as well as their entries 
by A. During our discussion of the singular value decom¬ 
position though we switch to the respective notational 
standards where the i’th singular value will be denoted 
by CTi - which however is the same quantity as the A^ in 
this section. 


C. Error analysis 


straight forward and illustrative example is the standard 
3rd-order expansion 


exp {iH6t) = exp {i (F + G) 6t) = 


( 8 ) 


= exp i-F6t ■ exp {iG6t) ■ exp i-Fdt -I- 0{5t) . 
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This leads to three sweeps {\F, G, ^F) of non¬ 
overlapping gate applications, comprising one time step. 
Various higher-order schemes for such decompositions ex¬ 
ists, differing in the number of sweeps and the order of the 
resulting error [19]. However, these kind of schemes may 
introduce non-orthogonal components since the order of 
gate applications is not successive. This can be circum¬ 
vented by resorting to schemes that produce sweeps with 
ordered, successive gate applications [20]. 

A decomposition to order p introduces an error of order 
^st = . The error incurred in one time step in 

general scales linearly with the system size N. This is 
due to the nested commutators occurring in the error 
term of the Suzuki-Trotter decomposition as can be seen 
when applying the Baker-Campbell-Hausdorff formula. 
Since the number of time steps taken is the total time 
T divided by the number of time steps T/6t, the total 
Trotter error Ctrotter is of order O {{6t)^ NT) [21]. 

The second considered error source is the truncation 
error. It stems from the truncation of the Schmidt values 
during the application of a two-site gate. Employing the 
Schmidt decomposition, a bipartite state can be written 
as 


X X 

2—1 

= I V'trunc ) + I V'-L ) (9) 

where the left sum until x' denotes the kept part | V'trunc) 
and the right sum starting from -I- 1 denotes the dis¬ 
carded part I V'-L )■ Due to the fact that the | ) are 

mutually orthogonal (as are those of the right subsys¬ 
tem), the discarded part is orthogonal to the retained. 
Given that the squared Schmidt values sum up to 1, this 
truncation leads to a deviation in the norm of the state 


During a TEBD simulation, the two main error sources 
are the Trotter and the truncation error. Other small er¬ 
ror sources, often depending on implementational or algo- 
rithmical choices, are neglected in the following analysis. 

TEBD relies heavily on the nearest-neighbor structure 
of the underlying Hamiltonian to implement time evo¬ 
lution. Traditionally this is done by standard Suzuki- 
Trotter decompositions [18] where the total Hamiltonian 
is split into two terms H = F + G with each F and G 
are the sum over all even and odd (respectively) terms of 
the Hamiltonian. This way all terms within F (G) com¬ 
mute with each other, incurring no error while applying 
an operator of the form exp (oF) {a € C). The most 


X 

( V^trunc I '^trunc ) ~ 1 ^ ^ A^ =1 W (19) 

i=X' + i 

where we defined the discarded weight w = X]7x'-i-i ■ 

Thus when renormalizing | V’trunc ) we pick up a factor 
of \/ {1 — w). Thus upon n truncations we are off by a 
factor of about (1 — re)”* with n* being the number of 
truncations performed. Truncating each bond in each 
time step results in tit (x IjT and thus the truncation 
error is about 


NT 

^trunc ~ (1 


= exp 




( 11 ) 
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Thus we end up with a careful balancing of the two 
errors, depending on the size of the time step 6t. For 
smaller 6t we have a smaller truncation error. Yet this 
requires more truncations due to the larger number of 
time steps taken and thus in a larger truncation error. 


III. AN ADVANCED APPLICATION OF THE 
TEBD ALGORITHM 

The TEBD algorithm is remarkably useful also in sce¬ 
narios which at first seem to be quite different from quan¬ 
tum many-body systems. One such example is its us¬ 
age in the time evolving density matrix using orthogonal 
polynomials algorithm (TEDOPA) capable of treating 
open quantum systems. We briefly present the TEDOPA 
scheme to show in which regimes RRSVD proves to be 
most useful and how to speed-up previous simulations 
and refer to [22-24] for a more detailed presentation of 
the algorithm. 

TEDOPA is a certifiable and numerically exact method 
to treat open quantum system dynamics [22, 24, 25]. It 
acts upon a spin-boson model description of an open 
quantum system where a central spin interacts linearly 
with an environment modelled by harmonic oscillators. 
In a two-stage process TEDOPA then first employs a uni¬ 
tary transformation reshaping the spin-boson model into 
a one-dimensional configuration. In a second step this 
emerging configuration is treated by TEBD, exploiting 
its full simulation power for one-dimensional systems. 

The total Hamiltonian is split into system, environ¬ 
ment and interaction part 

H — Hsys -b i?env + (12) 

^^max 

-^env — / 9 (^) 

^0 

^^max 

^^int = / dxh (x) {al + a^) A. (14) 

The bosonic creation and annihilation operators 
and Ux fulfill the usual bosonic commutation relations 
for the environmental mode x. The function g (x) can 
be identified with the environmental dispersion relation; 
the function h (x) gives the system-environment coupling 
strength for mode x between its displacement (al + Ux) 
and the operator A acting on the system. 

Here the functions g (x) and h(x), together with the 
temperature, uniquely characterize an environment and 
define the spectral density J (oj) given by 

J (uj) = TTh^ [g-^ (a;)] . (15) 

The interpretation of the quantity {dg~^ (cu) /duj) Suj is 
the number of quantised modes with frequencies between 
w and UJ + duj (for Jw —)■ 0). Further hold g the relations 
9~^[9ix)] =g[g~^{x)] = x. 


Then new oscillators with creation and annihilation 
operators and can be obtained by defining the an¬ 
alytical transformation Un (x) as 

Un (x) = h {x) pn {x), (16) 

^^max 

= / dxUn{,x)a\. (17) 

Jo 

It utilizes the orthogonal polynomials (x) defined 
with respect to the measure dp (x) = (x) dx. While in 

certain cases it is possible to perform this transformation 
analytically, in general a numerically stable procedure is 
used [22, 23, 26]. 

This transformation yields a semi-infinite one¬ 
dimensional nearest-neighbor Hamiltonian 

OO 

H —Hsys + tfjA (&0 + A ^nb\j}n 

n—1 

OO 

+ ^ ) tn (^n^ra+1 + (18) 

n—1 

whose nearest-neighbor geometry (which is necessary 
for the application of TEBD) as well as coefficients 
and tn are directly related to the recurrence coefficients 
of the three-term recurrence relation defined by the or¬ 
thogonal polynomials Pn (x) with respect to the measure 
dp (x) = (x) dx [23]. 



FIG. I: Illustration of the spin-bonson model’s 
transformation into a one-dimensional configuration 
where the system is only coupled to the environment’s 
first site. 


This transformation of the configuration is depicted in 
Fig. 1, from the spin-boson model on the left to a one¬ 
dimensional geometry on the right. In a last step it is nec¬ 
essary to adjust this configuration further to suit numer¬ 
ical needs. The number of levels for the environment’s 
oscillator on site k can be restricted to dk^max to reduce 
required computational resources. A suitable value for 
dk,max is related to this site’s average occupation, de¬ 
pending on the environment’s structure and temperature. 
The number of sites that is required to faithfully repre¬ 
sent the environment has to be sufficiently large to com¬ 
pletely give the appearance of a “large” reservoir - one 
that avoids unphysical back-action on the system due 
to finite-size effects that lead to reflections at the sys¬ 
tem boundaries (see [27] for extensions of TEDOPA that 




5 


can alleviate this problem considerably). It should be 
noted that these truncations, while feasible numerically 
and easily justifiable in a hand-waving kind of argument, 
in the end can also be rigorously certified by analyti¬ 
cal bounds [25]. These adjustments yield an emerging 
system-environment configuration which is now conve¬ 
niently accessible by TEBD. 


IV. REDUCED-RANK RANDOMIZED SVD 


The Singular Value Decomposition (SVD) is at the 
heart of the MPS representation and MPS-based algo¬ 
rithms, such as TEBD. The efficiency of TEBD comes 
from the possibility of approximating states living in an 
exponentially large Hilbert space with states defined by a 
number of parameters that grows only polynomially with 
the system size. In order to understand why the SVD 
plays such a crucial role, we introduce the following prob¬ 
lem: given a complex m x n matrix A, provide the best 
rank-fc (fc < n) approximation of A. Without loss of gen¬ 
erality we suppose m > n and rank{A) = n. The solu¬ 
tion to this problem is well known [28]: first compute the 
Singular Value Decomposition oi A = C/EPl, where U = 
U= are the 

left and right singular vectors of A respectively and 
S = diag (cti, a 2 , ■ ■ ■, <Jn) with cti > cr 2 > ... > a„. Then 
retain the first k largest singular values (cti, a 2 , ■ ■ ■ ,<Jk) 
and build the matrices Uk = and 

14 = The matrix Ak = U^Vl 

satisfies 


||4 — Afclli? = 


\ 






min \\A-A'\\f, 

rank{A' )—k 


(19) 


where || • \ \f indicates the Frobenius norm 


F = 


\ 


E' 


( 20 ) 


In other words, Ak provides the best rank-fc approxima¬ 
tion of A. This result justifies the application of SVD for 
the TEBD decimation step. 

The computational complexity of the SVD of A is 
0{m ■ vf). For large matrices, the SVD can therefore 
require a significant amount of time. This is a crucial 
point since every single TEBD simulation step of an N- 
sites spin chain requires O (N) SVDs which usually con¬ 
sumes about 90% of the total simulation time. 

As discussed in section II, the bond-dimension y re¬ 
quires discarding of n — y singular values (and corre¬ 
sponding left-/right- singular vectors). In the TEBD 
two-site update step, for example, we keep only y singu¬ 
lar values out of n = d • y (pure states) or n = df • x 


(mixed states). Most of the singular values and left- 
/right-singular vectors are therefore discarded. It means 
that we are investing time and computational resources 
to compute information that is then wasted. 

It is possible to avoid the full SVD of A and com¬ 
pute only the first k singular values and correspond¬ 
ing singular vectors by using Truncated SVD methods; 
such methods are standard tools in data-classification al¬ 
gorithms [29], signal-processing [30] and other research 
fields. The Implicitly Restarted Arnold! Method [31, 32] 
and the Lanczos-Iteration [33] algorithms, both belong¬ 
ing to the Krylov-subspace iterative methods [34, 35], are 
two examples. 

The Reduced-Rank Singular Value Decomposition 
(RRSVD), originally presented in by N. Halko et al. [15], 
is a randomized truncated SVD. It is particularly well 
suited to decompose structured matrices, such as those 
appearing in the TEBD simulation of non-critical quan¬ 
tum systems. Most interestingly, the algorithm is insensi¬ 
tive to the quality of the random number generator used, 
delivers highly accurate results and is, despite its ran¬ 
dom nature, very stable: the probability of failure can 
be made arbitrarily small with only minimal impact on 
the necessary computational resources. 

In what follows we will describe the RRSVD algorithm 
and report the main results on stability and accuracy. 
For a full account on RRSVD we refer the reader to [15]. 

The RRSVD algorithm is a two-step procedure. The 
first step constructs an orthogonal matrix Q whose 
columns constitute a basis for the approximated range 
of the input matrix A. In the second step, the approx¬ 
imated SVD of A is computed by performing a singular 
value decomposition of 

The approximation of the range can be done either for 
fixed error tolerance e, namely by finding a such that 

\\A-Q,QlA\\<e, (21) 

or for a fixed rank k of the approximation, that is by 
finding Qk such that 

min \\A-X\\!=s\\A-QkQlA\\. (22) 

rank{X)<k 


The first one is known as the fixed-precision approxima¬ 
tion problem^ whereas the second is known as the fixed- 
rank approximation problem. Here and in what follows, 
we indicate by jjAjj the operator norm, corresponding to 
the largest singular value of A. 

If aj is the j-th largest singular value of A then 


min [[A — Vjji? = 

rank(X)<k{e) 


= ||A - Qk(e)Q\(^)A\\F = 


\l ^ 

\ i=/c(e) + l 




(23) 


where <Jk(€) is the first singular value > e and the columns 
of the rank-fc(e) matrix Q are the first k{e) left singular 
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vectors of A. However, this would require the knowledge 
of the first k(e) singular values and vectors of A. 

For the sake of simplicity, let us focus initially on the 
fixed-rank approximation problem. In order to determine 
Qk, we resort to randomness. More precisely, we use a 
sample of k + p random vectors whose components 
are independently drawn form a standard Gaussian dis¬ 
tribution A/’(^=o,o-=i)• The set will be with very 

high probability a set of independent vectors. The param¬ 
eter p determines the amount of oversampling needed to 
make the rank-A: approximation of A more precise. The 
TO X (k + p) matrix 


Y = An, 

where n = ..., will therefore have full 

rank (fc +p). By re-orthogonalizing the columns of the 
matrix Y we obtain a basis for the rank-(fc + p) approx¬ 
imation of the range of A. The re-orthogonalization can 
be done by using the Qi?-decomposition Y = QR. If 
{k + p) < n, the computational cost of the first step 
of this algorithm is dominated by the matrix multipli¬ 
cation HO: this operation has an asymptotic complex¬ 
ity 0{mn{k + p)), whereas the QR decomposition of 
the TO X ik + p) matrix Y has asymptotic complexity 
0{m{k+pf) . 

When the input matrix A is very large, the singular 
vectors associated with small singular values may inter¬ 
fere with the calculation. In order to reduce their weight 
relative to the dominant singular values it is expedient 
to take powers of the original matrix A. So, instead of 
computing Y = An we compute 

z = Bn = (AA^YAn. ( 24 ) 

The singular vectors of B = {AA^^^A are the same as 
the singular vectors of A\ for the singular values of H, on 
the other hand, it holds: 

= (25) 

which leads to the desired reduction of the influence on 
the computation of the singular vectors associated to 
small singular values. This “stabilizing” step, also re¬ 
ferred to as Power Iteration step (PI), increases the com¬ 
putational cost of the first step of the RRSVD algorithm 
by a constant factor {2q + 1). 

A side effect of the PI is the extinction of all informa¬ 
tion pertaining to singular vectors associated to small 
singular values due to the finite size of floating point 
number representation. In order to avoid such losses, we 
use intermediate re-orthogonalizations (see Algorithm 1). 
We point out that in the typical context where TEBD is 
successfully applied, the singular values of the accord¬ 
ing matrices decay very fast, so the PI scheme must be 
applied. 

Since the RRSVD is a randomized algorithm, the qual¬ 
ity of the approximation comes in the form of expecta¬ 
tion values, standard deviations and failure probabilities. 


We report pertinent results that can be found (including 
proofs) in [15], as well as some particular results tuned 
to cases of specific relevance to applications for TEBD. 

Theorem 1. (Corollary 10.10 in [15]) Given a target 
rank k > 2 and an oversampling parameter p > 2, let Qz 
be the orthogonal to x (k+p) matrix consisting of the fist 
k+p left singular vectors of the matrix Z defined in (24). 
We define Pz = Q\Qz- Then 


E(||(I-Pz)H)||< 

/ 

1-f 


(26) 


< 


< 


k 




1 + 


p-1 J ‘ p 

k e-y/fc -I- p 


E 


j>k 
n l/(2«-el) 




P-1 


P 


An — k 


CTfc+i. 


The application of the PI scheme reduces the average 
error exponentially in q. In order to quantify the de¬ 
viations from the expected estimation error we use the 
following facts: 


\\{I-Pz)AA+A\\{T-Pz)B\\ (27) 

(Theorem 9.2 [15]) and 


\\{I-Py)A\\< 


1 -f 6V(fc + p) •7'log(p)crfc+i -f i^Jk+p 



(28) 

1 

2 


with probability greater or equal to 1— ^ (Corollary 10.9 
[15]). We can now state the following 


Corollary 1. Under the same hypotheses of Theorem 1 
it is 


P I II {T-Pz)A\\ < aAiOk+i + fiAi VcTj ) > 1 - 

(29) 

with a = (1 -I- 6-\/ {k -f p) ■ p\og{p)) and fi = 3\/k + P- 

Proof By applying (28) to B = (HHf)‘?H, and using (25) 
we have 


P 


{T - Pz) B\\ < 


\j>k 


> 1--^. 
pP 


Using the relation (27) we have that 


\\{T-Pz)A\\< 


acT 


2g+l 

fc+1 
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Algorithm 1: Randomized SVD with Power 
Iterations 

Require: m x n matrix A; integers I = k + p (rank of the 
approximation) and q (number of iterations). 

1: Draw an n X I Gaussian matrix Q. 

2: Form Yo — AQ.. 

3: Compute the QR factorization Yq = QqRq. 

4: for j = 1,2,... ,q do 
5: Form Yj = A^Qj-i. 

6: Compute the QR factorization Yj = QjRj- 

7: Form Yj — AQj. 

8: Compute the QR factorization Yj = QjRj- 

9: return Q = Qq. 


since the function f{q) = a}l^,a > 0,a; > 0 is convex, it 
holds 

/ , , i\ 1/(29+1) 

/ / \ 2 \ 





□ 


The results provided by the algorithm are usually 
closer to the average value than those estimated by the 
bound (26), which therefore seems to be not tight. How¬ 
ever, the important message of the preceding results is 
that by applying PI we obtain much better approxima¬ 
tions of A than those provided by the original scheme 
while error expectations and deviations are under full 
control. It would be most convenient to have some means 
to check how close QQ^ A is to the original input matrix 
A. With such a tool, we could check the quality of the 
approximation; moreover, we would be able to solve the 
fixed-error approximation problem (21). In the TEBD 
setting, this would allow us to determine the bond di¬ 
mension y for an assigned value e of the truncation error. 
The solution to this problem comes from this result: 


where = l,...,r are standard normal random vec¬ 
tors. 

Suppose that we have completed the first three steps 
of Algorithm 1. Set Q = Qq, with rank{Q) = I = k p 
and choose the size r of the sample. The Accuracy Check 
algorithm (Algorithm 2) takes in input A, Q, r and e and 
returns a new matrix Q' that satisfies the accuracy bound 
with probability 1 — 10“’’. 

Algorithm 2: Accuracy check 

Require: m x n matrix A; rank-fc -I- p projector Pq = QQ"'-, 
integer r; tolerance e. 

1; do 

2; Set I = k -\- p. 

3: Draw an n x r Ganssian matrix Hr-. 

4; Compute B = Aflr. 

5: Compute D = (/- Pq)AH^ = ..., 

6; Set MAX= max|d<dj’' 

7; if (MAX > e) then 

8: Build Q = Q\B. 

9: Set I = I r. __ 

10; Compute the QR decomposition Q — Q'R'. 

11; Set Q = Q'. 

12; while (MAX > e and I < n — r ) 

13; return Q. 

The computational cost of the Accuracy Check de¬ 
pends on different parameters. The cost of each iteration 
step is 0{m-P). Then we have to consider the iterations. 
If we take r too small (e.g. r = 1), then if the starting 
rank-^ approximation is not good enough, we might need 
many iteration to converge to the desired accuracy. As 
a rule of thumb, we suggest to double the rank of the 
approximation at each time. This will likely lead to over- 
sampling, but still delivers a good performance balance. 

Since the reference metric in TEBD is the Frobenius 
norm eq. (20), some estimate of the Frobenius norm via 
the operator norm is required. To this end we observe 
that the TEBD is successful when the correlations in the 
simulated system are sufficiently short-ranged, i.e. the 
“singular values decay fast enough”. If the entanglement 
between distant parts of the system is non-vanishing, the 
decimation step will lead to an important loss of relevant 
information. As an example let us consider a spin-chain 
of size n and a bipartition A = {1,2,.. .1}, B = [I-\- 
1,... ,n} of the chain. Let \ k)A and | be orthonormal 
bases for subsystems A and B respectively. Then the 
spin-chain state | ip ) can be Schmidt-decomposed as 


Theorem 2. (equation 4-3 supported by Lemma 4-1 [15]) 
WithM= {I-QQ^)A 

P ( ||M|| < low ^ max ] > 1 - 10“”, 

\ V TT i=l,2,...,r j 

(31) 


\ip) = ^Ck,j\k)A\j)B = ^a^\i)A\i)B, 

k,j i 

where in the last equality we used the SVD decomposi¬ 
tion of the matrix C = (cj^k) = to perform the 

Schmidt decomposition oi \i[). The Schmidt coefficients 
CTi are the singular values of C, i.e. the diagonal elements 






Proof. 


of S [36]. The number of non-zero singular values is the 
Schmidt number. The amount of entanglement between 
the subsystems A and B can be quantified by the von 
Neumann, or entanglement, entropy [37] of the reduced 
density matrices pA and pB 

SipA) = log(o-*^) = S{pb). (32) 


The decay rate of the singular value is therefore directly 
related to the amount of entanglement shared between 
two parts of a system. If the system is highly entangled, 
the singular values will decay “slowly”; in the limiting 
case where the system is maximally entangled, the re¬ 
duced density matrices will describe completely mixed 
states and the singular values will be all equal to each 
other. When the system is only slightly entangled, the 
singular values will decay very fast; in particular, if the 
state 1 V’) is separable, the Schmidt number is equal to 1 
(and (Ji = 1). The behavior of the entanglement entropy 
in systems at, or close to, the critical regime has been 
thoroughly studied (see [4] and references therein) to¬ 
gether with its dependence on the decay rate of the eigen¬ 
values of the reduced density matrices pa,b [5, 6, 38, 39]. 
By oversimplifying the problem, we observe that if the 
singular values decay as aj = Ij v(J the entanglement en¬ 
tropy shows, when the system size n —)■ oo, a divergence 
log^(n), whereas if they decay as Uj = 1/j the entan¬ 
glement entropy does converge, since an area law holds 
[4]. When the singular values decay as 1/j, however, the 
truncation error [[Tl — decreases slowly in k and 

the TEBD scheme becomes inefficient since any decima¬ 
tion will lead to considerable approximation errors (see 
Eq. (11)). Eor these reasons we consider the case of lin¬ 
early decreasing singular values as an extremal case for 
the range of applicability of the TEBD scheme. 

This observation provides a useful tool to estimate the 
Frobenius norm, which plays a central role in TEBD, 
through the operator norm. For any matrix A it is: 

11^1! < PIIf < ^/rank{A)\\A\\. 


= ^Tr{A^A) = 

The last term is upper bounded by 




n 


(34) 


CTi lim 

n—>-oo 


E l TT 




□ 

This result finds application in the Accuracy Check 
routine. If we set Pq = QQ^ we find 

11(7 - Pq)A\\f = ^Tr({A-PQA)^ {A-PqA)) 

= ^Tr{A^A)-{A^PQA) 

= ^\\A\\l-\\A^Q\\l 

< UAUf. (35) 

Therefore 

||(/-Pq)A11f<^||A11f (36) 

< ^TT max \\{I — QQ^)Auj^^^\\ 
with probability 1 — 10“”. 

We point out that it is not really necessary to estimate 
the Frobenius norm of the error: given Q we can com¬ 
pute 11(7 — QQ'^)A\\f directly. However one should note 
that this computation of QQ^A requires 2 (to ■n-k) float 
operations instead of the 2(jn • k • r) + m ■ n • r operations 
required to get the error estimate. 

Now that we have the orthogonal matrix Q whose 
columns constitute a basis for the approximated range 
of the input matrix A, we can directly compute the ap¬ 
proximate SVD of A by: 

i) Form the matrix B = Q^A. 


This result holds in general. The inequalities are sat¬ 
urated, in particular, when rank{A) = 1. The upper 
bound on the Frobenius norm is the tighter the closer 
the singular values of A are to each other. If the singular 
values decay at least as 1/j, on the other hand, we have 
the following result. 

Fact 1. Given a rank-n matrix A with singular values 
CTi > 0-2 > ... > cr„ with Oj < oilj,j = l,...,n, it 
holds: 

llAllF<^llAll=ai^ (33) 

where jjAjj indieates the operator norm. 


ii) Compute the SVD of 73: B = T/EW^. 

iii) Form the orthonormal matrix U = QU. 

The product Q^A requires 0{{k+p)-n-m) floating point 
operations. The SVD of 73, under the reasonable assump¬ 
tion that k + p < n, requires 0{n • {k +_p)^) operations. 
The product QU requires 0{m ■ {k + p)^) operations. 

We conclude this section by presenting some results 
on the computational cost of the whole procedure and 
on some more technical aspects related to the implemen¬ 
tation of the algorithm. The asymptotic computational 
cost of partial steps has been given at various places in 
this section. In summary, the real bottleneck of the com¬ 
plete RRSVD algorithm is the first matrix multiplication 
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Y = Ail which has complexity 0{m ■ n ■ {k + p)). The 
value of p can be set in advance or determined by the 
Accuracy Check method described above. All remaining 
operations, such as QR decompositions and error esti¬ 
mation, have smaller computational complexity. If the 
TEBD scheme is applied in a non-adaptive way, i.e. the 
bond dimension is kept fixed at a given value x, we use 
RRSVD to solve the fixed-rank problem. In this case 
RRSVD has complexity 0{m ■ n ■ x)- If the bond di¬ 
mension X is set independently of the input matrix size, 
the replacement of the standard SVD by RRSVD will 
therefore result in a speed-up linear in n. On the other 
hand, if we use an adaptive TEBD simulation where the 
bond dimension is set such that some bound on the ap¬ 
proximation error is satisfied, the cost of RRSVD will 
(strongly) depend on the structural properties of the in¬ 
put matrices. If the singular values decay exponentially 
(short-range correlations), then the expected speed-up is 
roughly the same as for the non-adaptive scheme. If the 
simulated system presents long-range correlations, then 
the speed-up provided RRSVD will be less then linear in 
n, possibly even vanishing. However, TEBD itself is not 
the ideal tool to deal with systems exhibiting long-range 
correlations, so this is only a minor limitation. 

Another crucial observation, related to the implemen¬ 
tation, is that all the operations required by RRSVD 
are standard functions of either the Basic Linear Algebra 
Subprograms (BLAS) [40] or the Linear Algebra PACK- 
age (LAPack [41]) libraries. Both libraries are heavily 
optimized and are available for single-core, multi-core 
(e.g. Intel Math Kernel Library (MKL) [42]) and Kilo- 
processor architectures (e.g. CuBLAS [43] and CULA 
[44]). Since TEBD simulations are usually run on many 
cores (on current cluster architectures often 8 or 16), 
RRSVD can take full advantage of the optimized li¬ 
braries. 


V. RRSVD CASE-STUDY AND PROFILING OF 
REAL TEBD SIMULATIONS 

We start by showing that, despite its random nature, 
RRSVD produces very accurate results with surprisingly 
small fluctuations. To this end we test the described algo¬ 
rithm on a sample of relatively small structured matrices 
extracted from pure-state TEBD simulations of our stan¬ 
dard system from section III, subsequently continuing to 
larger matrices from mixed-state TEBD simulations of 
the same system. We analyze the accuracy of RRSVD 
and its time performances, concluding the section by pre¬ 
senting how RRSVD impacts full TEBD simulations. 


A. Stability analysis 

In order to benchmark the stability of the RRSVD al¬ 
gorithm we consider a set of 7 diagonal 750 x 750 matri¬ 


ces Yim, m = 1,2,... ,7. The matrices Ym are extracted 
from an actual pure-state TEBD simulation of our bench¬ 
mark system described in section III. For every 'Em we 
generate a set of ha = 20 random matrices 
by randomly generating 1500 x 750 random orthonormal 
matrices Um,i and 750 x 750 random orthonormal matri¬ 
ces Vm,k] each Am,i has dimensions 1500 x 750. In this 
way we check for the dependence of both LaPack SVD 
and RRSVD on different matrices with the same struc¬ 
tural properties. In order to take the random nature of 
RRSVD into account, for each Am,i we perform nji = 20 
executions of RRSVD. In this first benchmark we provide 
a rank-50 approximation of the rank-750 original matri¬ 
ces Am,i and show how the accuracy is related to the 
number of subspace iterations g. Motivated by the theo¬ 
rems reported in the previous section, we set p = 50. We 
compare the accuracy and timing results for different val¬ 
ues of the iteration parameter q = 2,4, 6. The accuracy- 
check part of the algorithm is not included here: we do 
not estimate the difference between the original matrix 
A and its projection on the reduced space. 

By running the RRSVD on a set {Ak^iY^Ai random 
realizations of matrices exhibiting the same structure, 
i.e. same singular values E^, we check that the accu¬ 
racy of RRSVD depends only, for fixed numbers of it¬ 
erations q, approximation-rank k and oversampling pa¬ 
rameter p, on the structural properties of the matrices. 
Therefore, in what follows we present an analysis refer¬ 
ring to the instance corresponding to the random realiza¬ 
tion A 2 A, which is, in every respect, a good representative 
of the kind of matrices we deal with when performing a 
TEBD simulations of a pure quantum system far from 
the critical regime. In figure 2(a) we plot the singular 
values of ^ 2 . 1 - R is important to notice that some of 
the largest singular values are very close to each other. 
This is a typical situation in which truncated-SVD meth¬ 
ods belonging to the family of Krylov-subspace iterative 
methods are likely to require more iterations in order to 
accurately resolve the singular values. RRSVD, on the 
other hand, is completely insensitive to this peculiarity 
of the spectrum and provides very good results for the 
whole range of retained singular values {k = 50) start¬ 
ing from <7 = 4: the approximation error is compara¬ 
ble to that of the state-of-the-art LAPack SVD routine. 
Most noticeably, none of the hr executions of RRSVD 
on the instance ^ 2,1 presents outliers, that is to say sin¬ 
gular values that are computed with anomalous inaccu¬ 
racy (figure 2(b)). The behavior of the approximation 
error as a function of q is compatible with the theoret¬ 
ical results stated in the previous section. In Table I 
we show the speed-up tsvo/tRRSVD when both, MKL- 
SVD and MKL-based RRSVD (see section V B for more 
information about the implementation), are executed on 
an Intel Xeon X5570@2.93GHz by 8 concurrent threads. 
The speed-up over the MKL-SVD is obviously decreas¬ 
ing as q and k increase: ior k = p = 100 and q = 6 
almost no advantage remains in applying RRSVD in- 
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k/q 

0 

2 

4 

6 

8 

10 

50 

11.6 

5.4 

3.5 

2.6 

2.04 

1.7 

100 

4.7 

2.3 

1.5 

1.1 

0.89 

0.69 


TABLE I: A 2 ,i- RRSVD Speed-up tsvo/tRRSVD- 
LAPack SVD time: 3.84 s 


stead of the standard SVD. It is worth stressing here that 
RRSVD is meant to deliver a substantial advantage only 
for very large matrices and a comparatively small num¬ 
ber of retained dimensions, as we will show later. We 




(b) 


FIG. 2: Instance ^ 2 , 1 : m = 1500, n = 750; k = p = 50. 
Discarded weight: u> = 4 • 10“^ (a)Base-lO logarithmic 
plot of the singular values the decay appears 
roughly exponential. The inset shows the first 10 
singular values: some of these SVs are very close to 
each other (about 10“®) (b) The errors 
log^o(ICTj — of the RRSVD for each singular 

value and for all ur = 20 executions of RRSVD on the 
same instance matrix A 2 P for different values of the 
iteration number q. The standard MKL SVD routine 
errors are shown as a thick black line. 



FIG. 3: Instance A^. m = 1500, n = 750; k = p = 50. 
The errors log^pda^ — of the RRSVD as in 

figure 2(b) but referring to the singular values of the 
matrix Ac and for q G {0, 2,4, 6, 8,10}. The discarded 
weight w for the chosen value of fc is w = 1 • 10“^. 

now turn our attention to the TEBD extremal case dis¬ 
cussed previously. We consider an to = 1500, n = 750 
random matrix Ac generated, as described at the be¬ 
ginning of this subsection, starting from singular values 
Tic = diag{ai,(72y ■ ■ ■ j'^n) with af = 1/i. As shown in 
figure 3(a), in order to provide the same accuracy deliv¬ 
ered by LAPack on the first k singular values, we need to 
increase the number of iterations. For k = 50 and g = 10 
RRSVD is still able to provide some speed-up over the 
LAPack SVD. But the real problem is the truncation er¬ 
ror: for fc = 50 we have a truncation error of e « 10“^. 
But in order to achieve e < 10“^, about 650 singular val¬ 
ues need to be retained. This results in a major loss of 
efficiency of the TEBD simulation scheme. 

Therefore we can claim that RRSVD is indeed a fast 
and reliable method, able to successfully replace the stan¬ 
dard SVD in the TEBD algorithm in all situations where 
TEBD can successfully be applied. 

B. Performance on larger matrices and TEBD 
profiling 

Now that the basic properties of the algorithm are es¬ 
tablished, we test it on larger matrices sampled from 
mixed-state TEBD simulation of our benchmark system 
from section III. Given the bond dimension y and the 
local dimension d of the sites involved in the two-site- 
update, the size of the matrix given as input to the SVD 
is X In the following example we set A: = y = 100 
and the dimension of the local oscillators to 3,4, 5,6 and 
7 respectively. We therefore present results for matrices 
of dimensions = 900 x 900, d^ = 1600 x 1600, ds = 
2500 X 2500, de = 3600 x 3600 and dj = 4900 x 4900. 
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The structural properties of the test matrices considered 
are similar to the non-critical instances considered in the 
previous subsection. 

We first analyzed the results provided by the RRSVD 
routine on a large sample of matrices (200 instances 
for each dimension). We determined that the RRSVD 
reaches the LAPack accuracy for a number g = 2 of PI 
steps. 

We have developed three versions of the RRSVD al¬ 
gorithm: BL-, MKL- and GPU-RRSVD; each one uses 
a different implementation of the BLAS and LAPack li¬ 
braries. BL-RRSVD is based on a standard single-thread 
implementation (CBLAS [45], LAPACKE [46]); MKL- 
RRSVD uses the Intel® implementation Math Kernel 
Library (MKL [42]); GPU-RRSVD exploits CUBLAS[43] 
and GULA [44], i.e. the BLAS and LAPack libraries for 
Nvidia® Graphics Processing Units (GPUs). RRSVD 
is available for single/double precision real/complex ma¬ 
trices in each version. We refer the reader to [47] for 
more details about our RRSVD implementations. Dur¬ 
ing the completion of this work, another implementation 
of RRSVD from one of the authors of [15] has been re¬ 
ported in [48]. There the authors present three variants 
of RRSVD (essentially RRSVD with and without the PI 
and the Accuracy check) and discuss their performance 
on large (up to 6000 x 12000 ) real matrices with slowly 
decaying singular values. The implementation described 
in [48] is available for single-multi and Kilo processor ar¬ 
chitectures, as ours, but is limited to double precision real 
matrices. We are currently working on a full comparison 
between our and this other version of RRSVD. 

In Table II we show the time required to perform 
the SVD/RRSVD of da, ^ 4 , ^ 5,^6 and dy double-precision 
complex matrices for the three implementations. 

RRSVD provides a speed-up (figure 4) which is con¬ 
sistent with the predictions except in the 16-MKL case, 
there it grows stronger than linearly for matrix sizes in 
the range da —dg. This peculiar behavior is due to the low 
impact of the initial matrix multiplication Y = AVL (and 
subsequent ones) on matrices of such sizes: this opera¬ 
tion is heavily optimized for multi-threaded execution. 
Then the operations that determine the computational 
cost are the QR and final SVD decompositions, which 
have complexity O {m- {k -|-p)^). Since k + p is kept 
constant, we have a quadratic speed-up. This justifica¬ 
tion is supported, for example, by the speed-up scaling 
of the I-MKL case. However, as the matrix size increases 
the speed-up will tend to be linear (see the dy case). The 
performance on RRSVD provided by GPUs are compara¬ 
ble those delivered by the 16-MKL . Indeed, the standard 
SVD is faster on the GPU starting from size dg. This be¬ 
havior was expected, since our test matrices are still too 
small to take full advantage of the Kilo-processor archi¬ 
tecture. 

At last, we perform full TEBD simulations: for each 
dimension di, i = 3,4,5,6,7 we executed one TEBD 


Speed-Up 



EIG. 4: k = p = 100; q = 2. Scaling of the 
SVD/RRSVD speed-up for different platforms as a 
function of the matrix size, as obtained from the data 
reported in Table 11. 


simulation with the standard SVD routine, and another 
TEBD with RRSVD. We ran all the jobs on the same 
cluster node, equipped with two Xeon X5570@2.93GHz 
with 8 cores each, as to assure a fair comparison. In figure 
5 we show the average time required by the TEBD two- 
site update when standard SVD and RRSVD are used. 
The overall speed-up of the two-site update (inset of fig¬ 
ure 5) agrees with the Amdahl law [49]. The average 
is taken over all maximum-sized two-site updates per¬ 
formed in the TEBD simulation (more than 1500 for ev¬ 
ery dimension). When the standard SVD is applied, it 
takes more than 90% of the overall computation time; 
however if RRSVD is employed, SVD-times reduce dras¬ 
tically. Its computational cost is of the same order as 
that of the building of the 0 matrix (cf. figure 5). The 
fluctuations around the average RRSVD time are due to 
the action of the Accuracy-Check: from time to time the 
bond dimension x must be increased in order to keep the 
approximation error below the threshold value e = 10 “^ 
required in the simulation. According to Corollary 1, for 
the choice p = 100 and q = 2, such an increase is moti¬ 
vated only by the decay rate of the singular values: the 
failure probability is smaller than l/IO^™. This is con¬ 
firmed by an a posteriori analysis of the matrices pro¬ 
duced during our test TEBD simulations that required 
an extension of the bond dimension. The rank of the ap¬ 
proximation proposed by RRSVD, for assigned tolerance 
e, can therefore be used to detect either an entanglement 
build-up in the system or an (unlikely) anomalous behav¬ 
ior of the RRSVD itself. 


VI. CONCLUSIONS AND OUTLOOK 

The TEBD-algorithm, an important tool for the nu¬ 
merical description of one-dimensional quantum many- 
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l-BL-SVD 

1-BL-RRSVD 

1-MKL-SVD 

1-MKL-RRSVD 

16-MKL-SVD 

16-MKL-RRSVD 

GPU-SVD 

GPU-RRSVD 

ds 

13.27 

6.62 

1.47 

0.71 

0.46 

0.14 

1.08 

0.25 

di 

262.47 

21.38 

9.31 

1.69 

1.92 

0.36 

3.92 

0.41 

ds 

449.97 

37.19 

31.87 

3.62 

6.07 

0.54 

9.95 

0.61 

ds 

1464.67 

74.48 

97.37 

6.97 

22.93 

0.84 

21.97 

0.88 

dr 

1973.23 

99.9 

241.01 

11.49 

61.51 

1.43 

49.00 

1.48 


TABLE II: Execution time, in seconds, for the SVD of matrices of different sizes with LAPack SVD and RRSVD. 
The parameters for RRSVD are q = 2, k = p = 100. 1-BL: single-thread CBLAS-LAPACKE-based implementation, 
executed on a Intel i7@2.66GHz processor. MKL: MKL-based implementation; 1-MKL: with one MKL thread, 
16-MKL: with 16 MKL-threads, executed on one and two 8-core Xeon X5570@2.93GHz respectively. GPU: 
GUBLAS-GULA implementation executed on a NVIDIA K20s; the timing in this case includes host-to-device and 
device-to-host memory transfers. 



Dimension 


FIG. 5: Profile of the average time required by the 
SVD during a TEBD update step; left bars are for the 
RRSVD, right bars for the standard SVD. The data 
refers to the MKL implementation executed with 16 
MKL-threads. The (unbiased) standard deviation of the 
execution time for the singular value decomposition 
part of the update step is shown as an error bar on top 
of each bar. The inset presents the overall speed-up for 
one complete TEBD two-site update achieved by the 
RRSVD. 


body systems, depends essentially on the SVD which, in 
current implementations represents the bottleneck. We 
have demonstrated that this bottleneck can be addressed 
by the successful application of the RRSVD algorithm 
to TEBD simulations. The block decimation step of the 
TEBD update procedure is now approximately one order 
of magnitude faster than with the standard SVD without 
incurring additional accuracy losses. We note that in our 
test cases we have always chosen the RRSVD parameters 
such that we obtain singular values (and vectors) which 


were as precise as those provided by the standard (LA- 
Pack) SVD routine. By relaxing this requirement, the 
speed-up can be increased further. Moreover, by aug¬ 
menting RRSVD with the Accuracy Gheck feature we are 
able not only to detect any very unlikely deviations from 
the average behavior, but also to understand whether the 
system is experiencing an entanglement build-up which 
would require an increasing of the retained Hilbert space. 

In this paper we focused on the TEBD algorithm and 
its application to the one-dimensional system obtained 
through a TEDOPA mapping of an open quantum sys¬ 
tem (section III). In this context, RRSVD makes it pos¬ 
sible to increase the dimension of the local oscillators 
with a much reduced impact on the computational time, 
thus allowing for the efficient simulation of the system 
at higher temperatures. However, all MPS algorithms 
relying on the SVD to select the relevant Hilbert space 
can greatly benefit from the use of the RRSVD routine, 
as long as the ratio between the number of retained and 
total singular values is sufficiently small. 

The real scope and impact of this new computational 
tool is still to be understood fully. To this end, we 
prepared the RRSVD-Package, a library that provides 
the RRSVD routine for single-/double-precision and real- 
/complex-matrices. The package includes all the different 
implementations (BL, MKL, GPU) of RRSVD and has 
been designed to be plugged into existing codes through 
very minor code modifications: in principle, it suffices 
to replace each call to SVD by a call to RRSVD. This 
should allow for a very quick test of RRSVD in different 
simulation codes and scenarios. The library is written 
in C-|—1-; a Fortran wrapper, that allows to call of the 
RRSVD routine from Fortran code, is included as well. 
Some of the matrices used for the analysis of RRSVD in 
this paper are available, together with some stand-alone 
code that exemplifies the use of RRSVD and how to re¬ 
produce some of the results reported in this paper. The 
RRSVD-Package is freely available at [47] . 

The results obtained for the GPU implementation are 
rather promising: for the largest matrices considered 
(^ 5 ,^ 6 , dr) the GPU performs as well as the cluster node. 
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Preliminary results on even larger matrices show that a 
GPU can become a valid alternative means to perform 
TEBD simulations of system with high local dimensions, 
or when the number of retained dimensions must be in¬ 
creased because of larger correlation lengths. Moreover, 
if operated in the right way, a GPU can act as a large 
cluster of machines [50] without the difficulties stemming 
from the need of distributing the workload among dif¬ 
ferent computational nodes (Message Passing Interface 
(MPI)). A full GPU version of TEBD can make the ac¬ 
cess to super-computing facilities superfluous: a typical 
laboratory workstation equipped with one or two GPUs 
would be sufficient. We are currently re-analyzing the 
TEBD algorithm to expose further options for paral¬ 
lelization, as for example in the construction of the 0 


matrix. It could be computed by an ad-hoc designed 
GUDA-kernel and is a valid target for improvement now 
that its computational complexity is similar to that of 
the SVD. 
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